Goto

Collaborating Authors

 parallel text


Intertextual Parallel Detection in Biblical Hebrew: A Transformer-Based Benchmark

arXiv.org Artificial Intelligence

Identifying parallel passages in biblical Hebrew (BH) is central to biblical scholarship for understanding intertextual relationships. Traditional methods rely on manual comparison, a labor-intensive process prone to human error. This study evaluates the potential of pre-trained transformer-based language models, including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in the Hebrew Bible. Focusing on known parallels between Samuel/Kings and Chronicles, I assessed each model's capability to generate word embeddings distinguishing parallel from non-parallel passages. Using cosine similarity and Wasserstein Distance measures, I found that E5 and AlephBERT show promise; E5 excels in parallel detection, while AlephBERT demonstrates stronger non-parallel differentiation. These findings indicate that pre-trained models can enhance the efficiency and accuracy of detecting intertextual parallels in ancient texts, suggesting broader applications for ancient language studies.


An Interactive UI to Support Sensemaking over Collections of Parallel Texts

arXiv.org Artificial Intelligence

Scientists and science journalists, among others, often need to make sense of a large number of papers and how they compare with each other in scope, focus, findings, or any other important factors. However, with a large corpus of papers, it's cognitively demanding to pairwise compare and contrast them all with each other. Fully automating this review process would be infeasible, because it often requires domain-specific knowledge, as well as understanding what the context and motivations for the review are. While there are existing tools to help with the process of organizing and annotating papers for literature reviews, at the core they still rely on people to serially read through papers and manually make sense of relevant information. We present AVTALER, which combines peoples' unique skills, contextual awareness, and knowledge, together with the strength of automation. Given a set of comparable text excerpts from a paper corpus, it supports users in sensemaking and contrasting paper attributes by interactively aligning text excerpts in a table so that comparable details are presented in a shared column. AVTALER is based on a core alignment algorithm that makes use of modern NLP tools. Furthermore, AVTALER is a mixed-initiative system: users can interactively give the system constraints which are integrated into the alignment construction process.


Can AI help to increase access to all languages?

#artificialintelligence

Languages are the main medium of communication but there are more than 7,100 languages spoken around the world. People who live in different parts of the world speak different languages and it's sometimes hard to communicate with people who don't speak our language. This hinders relationships between people and makes it hard to understand one another or build trust. The ability to translate language, then, makes it easier to communicate across borders, and make information more accessible. With the advances in technology and artificial intelligence, online translators such as Google Translate, DeepL, and Bing Translate have made communication a lot easier among those speaking different languages.


Multilingual Transformer Encoders: a Word-Level Task-Agnostic Evaluation

arXiv.org Artificial Intelligence

Some Transformer-based models can perform cross-lingual transfer learning: those models can be trained on a specific task in one language and give relatively good results on the same task in another language, despite having been pre-trained on monolingual tasks only. But, there is no consensus yet on whether those transformer-based models learn universal patterns across languages. We propose a word-level task-agnostic method to evaluate the alignment of contextualized representations built by such models. We show that our method provides more accurate translated word pairs than previous methods to evaluate word-level alignment. And our results show that some inner layers of multilingual Transformer-based models outperform other explicitly aligned representations, and even more so according to a stricter definition of multilingual alignment.


Improve Sentence Alignment by Divide-and-conquer

arXiv.org Artificial Intelligence

In this paper, we introduce a divide-and-conquer algorithm to improve sentence alignment speed. We utilize external bilingual sentence embeddings to find accurate hard delimiters for the parallel texts to be aligned. We use Monte Carlo simulation to show experimentally that using this divide-and-conquer algorithm, we can turn any quadratic time complexity sentence alignment algorithm into an algorithm with average time complexity of O(NlogN). On a standard OCR-generated dataset, our method improves the Bleualign baseline by 3 F1 points. Besides, when computational resources are restricted, our algorithm is faster than Vecalign in practice.


From Words to Sentences: A Progressive Learning Approach for Zero-resource Machine Translation with Visual Pivots

arXiv.org Artificial Intelligence

The neural machine translation model has suffered from the lack of large-scale parallel corpora. In contrast, we humans can learn multi-lingual translations even without parallel texts by referring our languages to the external world. To mimic such human learning behavior, we employ images as pivots to enable zero-resource translation learning. However, a picture tells a thousand words, which makes multi-lingual sentences pivoted by the same image noisy as mutual translations and thus hinders the translation model learning. In this work, we propose a progressive learning approach for image-pivoted zero-resource machine translation. Since words are less diverse when grounded in the image, we first learn word-level translation with image pivots, and then progress to learn the sentence-level translation by utilizing the learned word translation to suppress noises in image-pivoted multi-lingual sentences. Experimental results on two widely used image-pivot translation datasets, IAPR-TC12 and Multi30k, show that the proposed approach significantly outperforms other state-of-the-art methods.


Bible is providing data to help create AI that can can convert texts

Daily Mail - Science & tech

Scientists are now using the Bible to help algorithms perfect their language skills. An AI has been trained on various versions of the sacred text so it can convert written works into different styles for different audiences. Each version of the Bible contains more than 31,000 verses that the researchers used to produce over 1.5 million unique pairings of source and target verses. The Bible is helping algorithms perfect their translation skills. Internet tools that translate text between languages like English and Spanish are widely available.


Artificial intelligence goes bilingual--without a dictionary - Nova Languages

#artificialintelligence

Automatic language translation has come a long way, thanks to neural networks--computer algorithms that take inspiration from the human brain. But training such networks requires an enormous amount of data: millions of sentence-by-sentence translations to demonstrate how a human would do it. Now, two new papers show that neural networks can learn to translate with no parallel texts--a surprising advance that could make documents in many languages more accessible. "Imagine that you give one person lots of Chinese books and lots of Arabic books--none of them overlapping--and the person has to learn to translate Chinese to Arabic. That seems impossible, right?" says the first author of one study, Mikel Artetxe, a computer scientist at the University of the Basque Country (UPV) in San Sebastiàn, Spain.


For The First Time, AI Can Teach Itself Any Language On Earth

#artificialintelligence

To understand the potential of these new systems, it helps to know how current machine translation works. The current de facto standard is Google Translate, a system that covers 103 languages from Afrikaans to Zulu, including the top 10 languages in the world–in order, Mandarin, Spanish, English, Hindi, Bengali, Portuguese, Russian, Japanese, German, and Javanese. Google's system uses human-supervised neural networks that compare parallel texts–books and articles that have been previously translated by humans. By comparing extremely large amounts of these parallel texts, Google Translate learns the equivalences between any two given languages, thus acquiring the ability to quickly translate between them. Sometimes the translations are funny or don't really capture the original meaning but, in general, they are functional and, overtime, they're getting better and better.


Artificial Intelligence Goes Bilingual--Without a Dictionary

#artificialintelligence

Researcher groups at the University of the Basque Country in Spain, and at Facebook, have separately developed unsupervised machine-learning techniques for teaching neural networks to translate between languages without requiring parallel texts. Researchers at the University of the Basque Country (UPV) in Spain and Facebook have separately developed unsupervised machine-learning techniques for teaching neural networks to translate between languages with no parallel texts. Each method employs as training strategies back translation and denoising; in the first process, a sentence in one language is approximately translated into the other, then translated back into the original language, with networks adjusted to make subsequent attempts closer to identical. Meanwhile, denoising adds noise to a sentence by rearranging or removing words, and attempts to translate that back into the original. The UPV method translates more frequently during training, while the Facebook technique, in addition to encoding a sentence from one language into a more abstract representation before decoding it into the other language, also confirms the intermediate language is truly abstract.